fix: unify `ColumnNotFound` for `duckdb` and `pyspark` #2493

EdAbati · 2025-05-04T20:46:31Z

What type of PR is this? (check all applicable)

Related issues

Closes error reporting: unify "column not found" error message for DuckDB / spark-like #2472

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

EdAbati · 2025-05-04T21:13:43Z

I think I can make some other clean-up of repetitive code. I'll try tomorrow morning

EdAbati · 2025-05-05T07:04:48Z

I made a followup PR #2495 with the cleanup :)

MarcoGorelli

thanks for working on this! just got a comment on the .columns usage

MarcoGorelli · 2025-05-05T07:29:55Z

narwhals/_duckdb/expr.py

@@ -186,7 +187,14 @@ def from_column_names(
        context: _FullContext,
    ) -> Self:
        def func(df: DuckDBLazyFrame) -> list[duckdb.Expression]:
-            return [col(name) for name in evaluate_column_names(df)]
+            col_names = evaluate_column_names(df)
+            missing_columns = [c for c in col_names if c not in df.columns]


df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

I was hoping we could do something like we do for Polars. That is to say, when we do select / with_columns, we wrap them in try/except, and in the except block we intercept the error message to give a more useful / unified one

Ah interesting, I was not aware 😕

What is happening in the background in duckdb that causes this overhead ? Do you have a link to the docs? (Just want to learn more)

Also, is it a specific caveat of duckdb? I don't think we should worry about that in spark-like but I might be wrong

I will update the code tonight anyway (but of course feel free to add commits to this branch if you need it for today's release)

df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

@MarcoGorelli could we add that to (#805) and put more of a focus towards it? 🙏

I don't think it's documented, but evaluating .columns may sometimes require doing a full scan. Example:

In [48]: df = pl.DataFrame({'a': rng.integers(0, 10_000, 100_000_000), 'b': rng.integers(0, 10_000, 100_000_000)}) In [49]: rel = duckdb.table('df') 100% ▕████████████████████████████████████████████████████████████▏ In [50]: rel1 = duckdb.sql("""pivot rel on a""") In [51]: %timeit rel.columns 385 ns ± 7.62 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) In [52]: %timeit rel1.columns 585 μs ± 3.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Granted, we don't have pivot in the Narwhals lazy API, but a pivot may appear in the history of the relation which someone passes to nw.from_native, and the output schema of pivot is value-dependent (😩 )

The same consideration should apply to spark-like

How do those timings compare to other operations/metadata lookups on the same tables?

.alias for example is completely non-value-dependent, so that stays fast

In [60]: %timeit rel.alias 342 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) In [61]: %timeit rel1.alias 393 ns ± 2.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

narwhals/_spark_like/dataframe.py

EdAbati · 2025-05-06T06:12:47Z

narwhals/_spark_like/dataframe.py

+            try:
+                return self._with_native(self.native.select(*new_columns_list))
+            except AnalysisException as e:
+                msg = f"Selected columns not found in the DataFrame.\n\nHint: Did you mean one of these columns: {self.columns}?"


Not 100% sure about this error message. I don't we can access the missing column names at this level, am I missing something?

I think what you're written is great - even though we can't access them, we can still try to be helpful

…und-error

EdAbati · 2025-05-09T12:04:41Z

tests/frame/select_test.py

I split the test into lazy and eager to simplify a bit the if-else statements. I hope it is a bit more readable ?

EdAbati · 2025-05-09T12:05:28Z

tests/frame/select_test.py

+        return df
+
+    if constructor_id == "polars[lazy]":
+        msg = r"^e|\"(e|f)\""


Before it was msg = "e|f". Now it is a bit stricter

EdAbati · 2025-05-09T12:05:58Z

tests/frame/select_test.py

+    with pytest.raises(ColumnNotFoundError, match=msg):
+        df.select(nw.col("fdfa"))


before this was not tested for polars

EdAbati · 2025-05-09T12:08:06Z

tests/frame/select_test.py

+    constructor_lazy: ConstructorLazy, request: pytest.FixtureRequest
+) -> None:
+    constructor_id = str(request.node.callspec.id)
+    if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]")):


sqlframe and pystpark.connect raise errors at collect. 😕

~~I need to double check pystpark.connect. Currently cannot set it up locally... Working on it ⏳~~

Do you have an idea on how to deal with these?

…und-error

dangotbanned · 2025-05-24T14:26:30Z

sqlframe is tricky because it raises at collect time. Also the error will be different based on the backend. Do we have a way to know which type of sqlframe we are dealing with?

@EdAbati IIRC there's some import-related functions in _spark_like.utils that may be helpful for you?

I'm assuming this would work in a similar way, where you either need to use some Base* class or use the string in the concrete type in an import path

…und-error

EdAbati · 2025-06-12T06:37:57Z

Found some time to update this. (sorry for the late reply)

@EdAbati IIRC there's some import-related functions in _spark_like.utils that may be helpful for you?

The problem IMO is that since sqlframe lets backend raise their errors we can only intercepts the ones of the backends we also support (i.e. pyspark and duckdb)
Not sure if the best solution would be to let sqlframe do their error handling or to intercept the errors just for the backend we support.
Maybe it should be discussed in an issue/follow-up?

@MarcoGorelli is there anything that you think is missing now? :)

MarcoGorelli · 2025-06-24T10:28:02Z

thanks! i think the logic looks right, the tests look a little complex but maybe that's ok. will get back to this shortly

MarcoGorelli

thanks @EdAbati ! looks good to me

just a couple of merge conflicts and a suggestion on the tests, but then i'd say we can ship it 🚢

MarcoGorelli · 2025-06-27T16:20:44Z

tests/conftest.py

+    elif "constructor_lazy" in metafunc.fixturenames:
+        metafunc.parametrize(
+            "constructor_lazy", lazy_constructors, ids=lazy_constructors_ids
+        )


I feel slightly uneasy about adding an extra constructor just for one test. and if we need to add it, then maybe constructors could be a union of eager_constructors and lazy_constructors, rather than making all 3?

if possible, i'd suggest to leave as-is for now and see if it's possible to use constructor in the test

I updated now. I thought test_missing_columns was going to be less readable, but it doesn't make a lot of difference.

MarcoGorelli · 2025-06-27T16:22:12Z

The problem IMO is that since sqlframe lets backend raise their errors we can only intercepts the ones of the backends we also support (i.e. pyspark and duckdb)
Not sure if the best solution would be to let sqlframe do their error handling or to intercept the errors just for the backend we support.
Maybe it should be discussed in an issue/follow-up?

Yeah that's fine - I think in general it's ok to aim for "we try to unify what we can, but there may be some differences that we have no control over"

…und-error

EdAbati

Sorry for the delay again 🥲 let me know if there is still something that looks off, I have some time today to update

and thank you for the review

EdAbati · 2025-07-08T06:45:13Z

tests/utils.py

@@ -42,6 +42,7 @@ def get_module_version_as_tuple(module_name: str) -> tuple[int, ...]:

 Constructor: TypeAlias = Callable[[Any], "NativeLazyFrame | NativeFrame | DataFrameLike"]
 ConstructorEager: TypeAlias = Callable[[Any], "NativeFrame | DataFrameLike"]
+ConstructorLazy: TypeAlias = Callable[[Any], "NativeLazyFrame"]


@MarcoGorelli do you think we should delete this too?
I think it actually makes the LAZY_CONSTRUCTORS: dict[str, ConstructorLazy] a bit more accurate/stricter

MarcoGorelli

thanks @EdAbati ! i like the "sqlframe" check you've used in the tests, perhaps that deserves to be its own separate utility (in a separate pr) if you fancy?

just left a comment on the xfails but i think then we can ship it

MarcoGorelli · 2025-07-13T10:09:47Z

tests/frame/select_test.py

+    constructor_id = str(request.node.callspec.id)
+    if any(id_ == constructor_id for id_ in ("sqlframe", "ibis")):


nice! should we do this pattern throughout the test suite / have some helper function for it?

Do you mean using request.node.callspec.id instead of the entire constructor as string?

Yeah I like it a bit more too, I will make a followup PR :)

MarcoGorelli · 2025-07-13T10:12:50Z

tests/frame/with_columns_test.py

+    if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]", "ibis")):
+        # These backend raise errors at collect


is it an issue that they raise errors at collect?

because below, we do

if "polars_lazy" in str(constructor) and isinstance(df, nw.LazyFrame): # In the lazy case, Polars only errors when we call `collect` with pytest.raises(ColumnNotFoundError, match=msg): df.with_columns(d=nw.col("c") + 1).collect()

perhaps we could change that to just be

if isinstance(df, nw.LazyFrame): # In the lazy case, Polars only errors when we call `collect` with pytest.raises(ColumnNotFoundError, match=msg): df.with_columns(d=nw.col("c") + 1).collect()

?

Good point!

actually the comment here was misleading, fixed it.
At collect sqlframe will re-surface the error of the backend it uses, it won't be a ColumnNotFoundError (yet?). ibis was introduced after I started working on this I think (soooorry it took soo long 🥲), maybe we can work on ibis in a follow up?

Also "pyspark[connect]" should not have been a xfail. now it is tested. Thanks for the catch :)

EdAbati added 3 commits May 4, 2025 22:45

unify ColumnNotFound

6f7a574

revert

8fe45e6

Merge branch 'main' into unify-column-not-found-error

45f09e0

EdAbati mentioned this pull request May 5, 2025

chore: always use check_columns_exist where possible #2495

Merged

10 tasks

EdAbati marked this pull request as ready for review May 5, 2025 07:04

EdAbati requested review from MarcoGorelli and dangotbanned May 5, 2025 07:09

MarcoGorelli requested changes May 5, 2025

View reviewed changes

EdAbati added 2 commits May 6, 2025 08:03

try except during select

6d58bc2

catch correct exception

be834b3

EdAbati added pyspark Issue is related to pyspark backend pyspark-connect error reporting labels May 6, 2025

EdAbati commented May 6, 2025

View reviewed changes

narwhals/_spark_like/dataframe.py Outdated Show resolved Hide resolved

EdAbati commented May 6, 2025

View reviewed changes

EdAbati added 7 commits May 6, 2025 08:13

coverage

19d6e24

Merge branch 'main' into unify-column-not-found-error

f0a9821

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

c1cabd1

…und-error

separate lazy and eager tests

917d073

coverage

5d2972d

cleanup exception

4080101

use constructor_id

8442596

EdAbati commented May 9, 2025

View reviewed changes

EdAbati changed the title ~~fix: unify ColumnNotFound for duckdb and pyspark/sqlframe~~ fix: unify ColumnNotFound for duckdb and pyspark May 9, 2025

what is going on in pyspark connect?

937a123

EdAbati added 4 commits May 23, 2025 13:33

coverage happier

49c4806

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

76b209d

…und-error

Merge branch 'main' into unify-column-not-found-error

6e23961

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

dd7a757

…und-error

EdAbati added 6 commits May 26, 2025 08:25

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

cce992b

…und-error

fix drop test

67baf81

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

5d1cc71

…und-error

restore from_available_column_names

be7240f

Merge branch 'main' into unify-column-not-found-error

53021e8

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

fef9cb3

…und-error

Merge branch 'main' into unify-column-not-found-error

03cdad2

MarcoGorelli reviewed Jun 27, 2025

View reviewed changes

EdAbati added 2 commits July 8, 2025 08:04

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

7c898e1

…und-error

remove constructor_lazy

5c7e114

EdAbati commented Jul 8, 2025

View reviewed changes

Merge branch 'main' into unify-column-not-found-error

311ed35

MarcoGorelli reviewed Jul 13, 2025

View reviewed changes

EdAbati added 9 commits July 15, 2025 07:57

remove if for only polars

aa20974

Merge branch 'main' into unify-column-not-found-error

f994666

try removing pyspark[connect] from with_columns_missing_column xfails

592ed50

fix comment

2240ac0

fix pyspark[connect] test

c942c57

use maybe_collect in more places

710119d

fix comment

1e1edc1

make coverage happy

b36c866

make coverage really happy?

00d9cc4

		with pytest.raises(ColumnNotFoundError, match=msg):
		df.select(nw.col("fdfa"))

		constructor_id = str(request.node.callspec.id)
		if any(id_ == constructor_id for id_ in ("sqlframe", "ibis")):

		if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]", "ibis")):
		# These backend raise errors at collect

fix: unify ColumnNotFound for duckdb and pyspark #2493

Are you sure you want to change the base?

fix: unify ColumnNotFound for duckdb and pyspark #2493

Conversation

EdAbati commented May 4, 2025

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

EdAbati commented May 4, 2025

Uh oh!

EdAbati commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdAbati May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdAbati May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdAbati May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned commented May 24, 2025

Uh oh!

EdAbati commented Jun 12, 2025

Uh oh!

MarcoGorelli commented Jun 24, 2025

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Jun 27, 2025

Uh oh!

EdAbati left a comment

Choose a reason for hiding this comment

Uh oh!

EdAbati Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fix: unify `ColumnNotFound` for `duckdb` and `pyspark` #2493

fix: unify `ColumnNotFound` for `duckdb` and `pyspark` #2493

EdAbati commented May 5, 2025 •

edited

Loading

EdAbati May 5, 2025 •

edited

Loading

EdAbati May 9, 2025 •

edited

Loading

EdAbati May 9, 2025 •

edited

Loading

EdAbati Jul 8, 2025 •

edited

Loading

EdAbati Jul 15, 2025 •

edited

Loading